哈囉大家好,今天要用跟昨天一樣的資料集,由於用PCA做降維的效果不是那麼好,於是我參考了Kaggle一位大神的文章,這邊附上連結。
接續昨天從特徵工程開始。
這邊他在特徵處理的時候先採用了高斯混合聚類的方式做一個降維,它也是一種無監督學習的方式,跟Kmeans的差別在於Kmeans他是取一個中心然後去取距離,等於是劃一個圓圈的概念,而在某些情況下Kmeans會無法分類出準確的群。而高斯混合他不是基於距離的模型,而是以基於分布的模型,他是假設有一定數量的高斯分布,並且以一個高斯分布代表一個群,參考說明。
這裡是用sklearn官方文件的範例Gaussian Mixture Model Selection去做模型的挑選。首先他先將train和test的資料合併,變成10000筆資料,40個特徵。
X = np.r_[train,test]
print(X.shape)
附上官方文件參數說明,以及最下方選擇Gaussian Mixture Model Selection範例,這段code在這個範例裡面
lowest_bic = np.infty
bic = []
n_components_range = range(1, 7)
cv_types = ['spherical', 'tied', 'diag', 'full']
for cv_type in cv_types:
for n_components in n_components_range:
gmm = GaussianMixture(n_components=n_components, covariance_type=cv_type)
gmm.fit(X)
bic.append(gmm.aic(X))# AIC、BIC越小越好,查看最小時n_components(分幾群)為多少
if bic[-1] < lowest_bic:
lowest_bic = bic[-1]
best_gmm = gmm
best_gmm.fit(X)
在分成4群的時候,是BIC最小,也就是我選定的最佳模型。
再來我們用最佳模型best_gmm去predict我們的train和test資料集,取出分別4個族群的概率,作為4個特徵再去丟入我們其他機器學習的模型來預測出是0或1。
gmm_train = best_gmm.predict_proba(train)
gmm_test = best_gmm.predict_proba(test)
X_train_1= pd.DataFrame(gmm_train).values
X_submit = pd.DataFrame(gmm_test).values
後面一樣用SKFold交叉驗證還有GridSearchCV去找出模型的最佳參數,跟上一篇後面模型部分一樣。
X_train,X_test,y_train,y_test = train_test_split(gmm_train,trainLabel,test_size=0.30, random_state=101)
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)
sk_fold = StratifiedKFold(10,shuffle=True, random_state=42)
g_nb = GaussianNB()
knn = KNeighborsClassifier() # 參數:n_neighbors(鄰居數:預設為5)、weights(權重,預設為uniform)、leaf_size(葉的大小:預設為30)
ran_for = RandomForestClassifier()
# n_estimators:樹的顆數、max_depth:最大深度,剪枝用,超過全部剪掉。
# min_samples_leaf:搭配max_depth使用,一個節點在分枝後每個子節點都必須包含至少min_samples_leaf個訓練樣本
# bootstrap:重新取樣原有Data產生新的Data,取樣的過程是均勻且可以重複取樣
log_reg = LogisticRegression() #penalty:懲罰函數(預設L2)、C:正則強度倒數,預設為1.0、solver:解決器(默認='lbfgs'),saga對所有懲罰都可以使用
tree= DecisionTreeClassifier()
xgb = XGBClassifier() # https://www.itread01.com/content/1536594984.html 參數詳解
ada_boost = AdaBoostClassifier() # https://ask.hellobi.com/blog/zhangjunhong0428/12405 參數詳解
grad_boost = GradientBoostingClassifier(n_estimators=100) # https://www.itread01.com/content/1514358146.html 參數詳解
hist_grad_boost = HistGradientBoostingClassifier() # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html
clf = [("Naive Bayes", g_nb, {}), \
("K Nearest", knn, {"n_neighbors": [3, 5, 6, 7, 8, 9, 10], "leaf_size": [25, 30, 35]}), \
("Random Forest", ran_for,
{"n_estimators": [10, 50, 100, 200, 400], "max_depth": [3, 10, 20, 40], "random_state": [99],
"min_samples_leaf": [5, 10, 20, 40, 50], "bootstrap": [False]}), \
("Logistic Regression", log_reg, {"penalty": ['l2'], "C": [100, 10, 1.0, 0.1, 0.01], "solver": ['saga']}), \
("Decision Tree", tree, {}), \
("XGBoost", xgb,
{"n_estimators": [200], "max_depth": [3, 4, 5], "learning_rate": [.01, .1, .2], "subsample": [.8],
"colsample_bytree": [1], "gamma": [0, 1, 5], "lambda": [.01, .1, 1]}), \
\
("Adapative Boost", ada_boost, {"n_estimators": [100], "learning_rate": [.6, .8, 1]}), \
("Gradient Boost", grad_boost, {}), \
\
("Histogram GB", hist_grad_boost,
{"loss": ["binary_crossentropy"], "min_samples_leaf": [5, 10, 20, 40, 50], "l2_regularization": [0, .1, 1]})]
stack_list_gmm = []
train_scores_gmm = pd.DataFrame(columns=["Name", "Train Score", "Test Score"])
i = 0
for name, clf1, param_grid in clf:
clf = GridSearchCV(clf1, param_grid=param_grid, scoring="accuracy", cv=sk_fold, return_train_score=True)
clf.fit(X_train, y_train) # .reshape(-1,1)
y_pred = clf.best_estimator_.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print("=====================================")
train_scores_gmm.loc[i]= [name,clf.best_score_,(cm[0,0]+cm[1,1,])/(cm[0,0]+cm[0,1]+cm[1,0]+cm[1,1])]
stack_list_gmm.append(clf.best_estimator_)
i=i+1
est = [("g_nb",stack_list[0]),\
("knn",stack_list[1]),\
("ran_for",stack_list[2]),\
("log_reg",stack_list[3]),\
("dec_tree",stack_list[4]),\
("XGBoost",stack_list[5]),\
("ada_boost",stack_list[6]),\
("grad_boost",stack_list[7]),\
("hist_grad_boost",stack_list[8])]
最後我一樣會加上stacking再建模一次。
sc = StackingClassifier(estimators=est,final_estimator = None,cv=sk_fold,passthrough=False)
sc.fit(X_train,y_train)
y_pred = sc.predict(X_test)
cm1 = confusion_matrix(y_test,y_pred)
y_pred_train = sc.predict(X_train)
cm2 = confusion_matrix(y_train,y_pred_train)
train_scores_gmm.append(pd.Series(["Stacking",(cm2[0,0]+cm2[1,1,])/(cm2[0,0]+cm2[0,1]+cm2[1,0]+cm2[1,1]),(cm1[0,0]+cm1[1,1,])/(cm1[0,0]+cm1[0,1]+cm1[1,0]+cm1[1,1])],index=train_scores.columns),ignore_index=True)
大神果然就是大神,這個隨便一個模型都比我這個廢物自己用PCA好太多了,我們就選用test score最好的KNN或Logistic Regression上傳看看吧。
knn.fit(train_pca_df,trainLabel)
y_submit = knn.predict(test_pca_df)
y_submit= pd.DataFrame(y_submit)
y_submit.index +=1
y_submit.columns = ['Solution']
y_submit['Id'] = np.arange(1,y_submit.shape[0]+1)
y_submit = y_submit[['Id', 'Solution']]
y_submit.to_csv('./Submission_Gauss.csv',index=False)
這分數就很高啦,達到了99.14%。
如果有錯的地方,留言告訴我,我會虛心受教的,我會繼續努力!